information retrieval
A supplementary for the paper Falconn++: ALocality-sensitive Filtering Approach for Approximate Nearest Neighbor Search
We define ยต = ยต1 ยต2 > 0 and set the threshold t = ยต1 = (1 r2/2) 2lnD. Since ยต/ฯ2 is monotonic with respect to c, further points has a higher probability of being discarded. Therefore, the second property holds for any far away point y, i.e. y q cr. The first property holds for any close point x, i.e. x q r, since their projection value onto r1 follows a Gaussian distribution with mean ยต ยต1. Figure 1 shows the recall-speed comparison between Falconn++ and recent theoretical LSF frameworks [2, 3]. All 3 data sets use L = 100, ฮฑ = {0.1,0.5},
Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic" dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing", HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least 0.1n steps on instances of size nbefore it encounters any of the 5nearest neighbors of the query.
Learning List-Level Domain-Invariant Representations for Ranking
Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment--learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking.
SOAR: Improved Indexing for Approximate Nearest Neighbor Search
This paper introduces SOAR: Spilling with Orthogonality-Amplified Residuals, a novel data indexing technique for approximate nearest neighbor (ANN) search. SOAR extends upon previous approaches to ANN search, such as spill trees, that utilize multiple redundant representations while partitioning the data to reduce the probability of missing a nearest neighbor during search. Rather than training and computing these redundant representations independently, however, SOAR uses an orthogonality-amplified residual loss, which optimizes each representation to compensate for cases where other representations perform poorly. This drastically improves the overall index quality, resulting in state-of-the-art ANN benchmark performance while maintaining fast indexing times and low memory consumption.
Bing is the anti-AI search engine you should be using
PCWorld argues that Bing serves as a superior alternative to AI-heavy search engines by prioritizing human-authored content over automated summaries. AI search engines like Google's AI Mode often hide original sources and provide misleading information, with traffic to publishers dropping significantly.
The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More
Today's best language models still struggle with hallucinations, factually incorrect generations, which impede their ability to reliably retrieve information seen during training. The, where models cannot recall information when probed in a different order than was encountered during training, exemplifies limitations in information retrieval. To better understand these limitations, we reframe the reversal curse as a --- a failure of models to learn the same joint distribution under different factorizations.We more closely simulate finetuning workflows which train pretrained models on specialized knowledge by introducing, a realistic testbed based on Wikipedia knowledge graphs. Through a series of controlled experiments with increasing levels of realism, including non-reciprocal relations, we find that reliable information retrieval is an inherent failure of the next-token prediction objective used in popular large language models. Moreover, we demonstrate reliable information retrieval cannot be solved with scale, reversed tokens, or even naive bidirectional-attention training. Consequently, various approaches to finetuning on specialized data would necessarily provide mixed results on downstream tasks, unless the model has already seen the right sequence of tokens. Across five tasks of varying levels of complexity, our results uncover a promising path forward: factorization-agnostic objectives can significantly mitigate the reversal curse and hint at improved knowledge storage and planning capabilities.